Air Paradis : Detect bad buzz with deep learning

Context

"Air Paradis" is an airline company who's marketing department wants to be able to detect quickly "bad buzz" on social networks, to be able to anticipate and address issues as fast as possible. They need an AI API that can detect "bad buzz" and predict the reason for it.

The goal here is to evaluate three approaches to detect "bad buzz" :

Load project modules

The helpers functions and project specific code will be placed in ../src/.

We will use the Python programming language, and present here the code and results in this Notebook JupyterLab file.

We will use the usual libraries for data exploration, modeling and visualisation :

We will also use libraries specific to the goals of this project :

Exploratory data analysis (EDA)

We are going to load the data and analyse the distribution of each variable.

Load data

Let's download the data from the Kaggle - Sentiment140 dataset with 1.6 million tweets competition.

Now we can load the data.

Explore data

Let's display a few examples, find out how many data points are available, what are the variables and what is their distribution.

There are 1600000 rows, each composed of 6 columns :

We are only interrested in the target and text variables. The rest of the columns are not useful for our analysis.

There are exactly as many (800000) POSITIVE tweets as NEGATIVE tweets. There are no NEUTRAL tweets. The problem is well balanced and there will be no bias towards one class during the training of our models.

There are no big difference between the POSITIVE and NEGATIVE tweets, but NEGATIVE tweets are slightly longer than POSITIVE tweets. In both classes, there are two modes : ~45 characters and 138 characters (the maximum allowed at some point).

There are no big difference between the POSITIVE and NEGATIVE tweets, but NEGATIVE tweets are significatively longer than POSITIVE tweets. In both classes, there are two modes : ~7 words and ~20 words.

Text analysis

We will look more in details at what contains the text variable.

First, we will transform the dataset into a Bag of Words representation with TfIdf (Term Frequency - Inverse Document Frequency) weights. To achieve this, we are going to use th SpaCy tokenizer.

Our corpus is now transformed into a BoW representation. We can analyse the words frequencies.

We can see that the most important words actually meaningful and relevant regarding the sentiment associated to each message.

We will use this representation to build a classification model in order to predict the sentiment of a new message.

Classification model

We are going to train and evaluate a classification model to predict the sentiment of a new message.

Dimension reduction & Topic modeling

First, we need to reduce the dimension of the BoW representation : there are 240589 words in the corpus. We use the Latent Semantic Analysis (LSA) method to reduce the dimension of the BoW representation. This method will create Topics that are the most relevant to the corpus.

The elbow method should help us to choose the number of topics, but here there is no clear elbow, so we choose 50 topics.

The dataset is now reduced to 50 topics. We can observe the composition (most relevant words) of each topic.

We can identify the following topics :

Train and test the model

We can now train and test a classification model. We are going to use the Logistic Regression model.

Once the model is trained, we can observe which topics are the most relevant to the sentiment of the messages.

The most NEGATIVE topics are :

The most POSITIVE topics are :

Now we can measure the performance of our model. We are going to use the Confusion Matrix, the Precision-Recall curve (Average Precision metric) and the ROC curve (ROC AUC metric) to evaluate our model.

The performances on the train and test datasets are identical, so we know our model is well trained (no over/under-fitting).

The performances are quite correct for a baseline model :

Our model is biased towards the POSITIVE class : it predicted 35% more POSITIVE (918049) messages than NEGATIVE (681951).

Let's observe some classification errors.

On this false-positive example, we can see that the model is not able to predict the sentiment of the message, despite words like "bummer" and "should"...

On this false-negative example, we can see that the model is not able to predict the sentiment of the message. But in this case, the model is fooled by the presence of words like "sick", "cheap", "hurts", ...